Fixing the Exception Case for AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync #1235

sophiatev · 2025-08-20T00:22:11Z

AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync is not correctly catching the exception thrown in the case of a conflict when calling the ITrackingStore.UpdateStateAsync method. The InstanceStoreBackedTrackingStore implementation actually does not seem to throw any exceptions (it does an "unsafe" upsert, not using the passed in etag at all), while the AzureTableTrackingStore implementation wraps the currently-caught RequestFailedException exception in a DurableTaskStorageException, which is not being correctly caught. This PR adds this exception check to the method, and removes the no long relevant check for a RequestFailedException exception.

It also adds an additional check for the case of a HttpStatusCode.Conflict being returned - if this is the first work item for the orchestration, then there is no history and etag stored, and we follow this logic with action type TableTransactionActionType.Add. If the work item has already been completed, then the history already exists, and so the storage update fails with a HttpStatusCode.Conflict indicating that the table entity already exists. If this is not the first work item, then there is an etag stored, and the action type is instead TableTransactionActionType.UpdateMerge. In this case, if the work item has already been completed and a new etag stored, then the update fails with a HttpStatusCode.PreconditionFailed.

An example scenario that was previously not being caught, but is now correctly caught:

Worker A obtains a lease on partition 1 and starts working on a work item. Say for whatever reason, worker A does not renew the work item in a timely fashion (which in practice means that the control queue messages for the item become visible again).
Worker B steals the lease on partition 1 from worker A, dequeues the now-visible control queue messages from the partition, and starts working on the same work item.
Worker B completes the work item and calls AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync.
Worker A eventually finishes and also attempts to call AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync. If this is the first work item for the orchestration, then a DurableTaskStorageException with status code Conflict is thrown. If it is not the first work item, then the status code is PreconditionFailed. In either case, both of these exceptions are now caught and a SessionAbortedException is thrown, which is the intended behavior.

davidmrdavid · 2025-08-26T22:30:01Z

Update: seems this feedback has been incorporated, ty.

non-blocking tip (I'm just an OSS contributor now) - this seems like an involved failure case. I recommend adding in the PR description a short example of how this error case is reached, in the form of a timeline.

For example:
At

t=1, such and such happens.
t=2, then this happens
t=3, this happens, and now we're in an inconsistent state because Y
t=4, the inconsistency is observed, and the exception is thrown.

It might make it easier to determine that the PR is doing what's expected. Though others caught up with the context might not need this level of detail.

src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs

davidmrdavid · 2025-08-26T23:32:12Z

src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

                // Precondition failure is expected to be handled internally and logged as a warning.
                // The orchestration dispatcher will handle this exception by abandoning the work item
-                throw new SessionAbortedException("Aborting execution due to failed precondition.", rfe);
+                throw new SessionAbortedException("Aborting execution due to conflicting completion of the work item.", dtse);


nit - if this is possibly customer facing, it's useful to add a little hint here that the error is benign. Consider:

Suggested change

throw new SessionAbortedException("Aborting execution due to conflicting completion of the work item.", dtse);

throw new SessionAbortedException("Aborting execution because the work item was already completed. This can occur when another worker took over processing, and is benign when it occurs rarely", dtse);

I think SessionAbortedExceptions do not surface, though someone else from the team correct me if I'm wrong, please.

I won't insist on this one, it's my personal style to augment these nonetheless but it really doesn't matter. Feel free to dismiss.

test/DurableTask.AzureStorage.Tests/AzureStorageScaleTests.cs

davidmrdavid

Other than my comment changes, and the "partitionCount" log that should be generalized, the behavior here looks right to me.

Great job with the test, I personally don't think I ever had to write a test with such low level control over the orchestration processing state machine, so it's cool to see that in action.

As always - I'm just an OSS contributor, so please wait for someone actively in the team to approve before moving forward. Just wanted to signal that behavior-wise this seems correct, and that test seems complete.

Update: I cleared my approval just to avoid accidentally confusing the team into thinking the PR is already approved by a team member.

Co-authored-by: David Justo <[email protected]>

first commit

5f89929

sophiatev changed the title ~~Adding an Exception Case to CompleteTaskOrchestrationWorkItemAsync~~ Adding an Exception Case to AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync Aug 20, 2025

finally added an integration test

14af23a

sophiatev changed the title ~~Adding an Exception Case to AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync~~ Fixing the Exception Case for AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync Aug 26, 2025